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Directed acyclic graphs are a fundamental class of networks that includes citation networks, food webs, and 
family trees, among others. Here we define a random graph model for directed acyclic graphs and give solutions 
for a number of the model's properties, including connection probabilities and component sizes, as well as a 
fast algorithm for simulating the model on a computer. We compare the predictions of the model to a real-world 
network of citations between physics papers and find surprisingly good agreement, suggesting that the structure 
of the real network may be quite well described by the random graph. 



Many networks of scientific interest take the form of 
directed acyclic graphs — directed networks containing no 
closed cycles, i.e., paths that start and end at the same vertex 
and follow edges only in the forward direction The best 
known examples are citation networks |2] but there are many 
others as well, such as family trees, phylogenetic networks, 
food webs, feed-forward neural networks, and software call 
graphs. (Some of these are only approximately acyclic, but the 
approximation is typically good enough that acyclic graphs 
still provide a useful starting point for theories of network 
structure.) 

One of the most fundamental and important of theoretical 
models in the study of networks is the random graph. In its 
most general form, a random graph is a model network of 
a given number of vertices in which certain topological fea- 
tures are fixed but in all other respects edges are placed at 
random H S H 01 • Random graphs have significant ad- 
vantages as models of networks, allowing one to isolate the 
effects of particular structural parameters and being exactly 
solvable for many of their topological properties, both local 
and global. They have played a central role in the develop- 
ment of network theory, proving useful as a guide to both the 
qualitative and the quantitative properties of networks of many 
kinds. 

In this Letter, we present a random graph model for directed 
acyclic graphs. Despite the name "acyclic graph," the lack of 
cycles is in fact not the defining feature of most real-world 
acyclic graphs. The defining feature is that the vertices have a 
natural ordering. In a citation network of scientific papers, for 
instance, the papers are time-ordered by publication date and 
the network is acyclic because papers can only cite those that 
came before them, meaning that all edges point backward in 
time. (Note that self-edges are not allowed in acyclic graphs.) 
It is clear that all networks ordered in this way are acyclic, 
and it can be proved that for all acyclic networks at least one 
appropriate ordering of the vertices exists. In practical situa- 
tions, however, the ordering is normally the crucial property 
and it will be the defining feature for the models described in 
this paper 

Suppose then that we are given an ordered set of n vertices 
denoted by / = 1 . . . n and a corresponding degree sequence, 
i.e., a complete set of in- and out-degrees kf and A;™' for all 
vertices. In our representation all edges will point from "later" 



vertices (higher /) to "earlier" ones (lower /) as in a citation 
network. (Although we use the language of time in this paper, 
the ordering does not have to be a time ordering. In a food 
web, for example, the ordering represents trophic level.) 

It is not possible to construct an acyclic network on every 
degree sequence. Degree sequences, for instance, in which 
the first vertex has any outgoing edges (A:™' > 0) will not work 
because there are no earlier vertices for those edges to attach 
to. More generally, all edges outgoing from vertices 1 to / 
must attach at their other end to vertices in the range 1 to / — 1 
and hence a necessary condition on the degree sequence is 
Y.'jl^i kj > L/=i k""^ for all /, with the inequality becoming an 
equality for / — 1 and i = n. Defining the useful quantity 

.7=1 7=1 

this condition can also be written as > for ; = 2 . . . n — 1 
and A-i = A,„ = 0. It is straightforward to prove that this is also 
a sufficient condition for a degree sequence to be realizable as 
a network. Physically, A,, represents the number of edges that 
go around vertex /, meaning the number that connect vertices 
later than / to vertices earlier than /. 

We can visualize the degree sequence as a set of edge 
"stubs," outgoing and ingoing, attached in the appropriate 
numbers to each vertex. Our job is to match these stubs 
in pairs to create directed edges. Our definition of a ran- 
dom graph for directed acyclic networks is analogous to that 
of the standard "configuration model" for undirected net- 
works 1 5, 6, 7]: it is the graph generated by drawing uniformly 
at random from all allowed matchings of the stubs, where "al- 
lowed" in this case means matchings that respect the order- 
ing of the vertices. More coiTectly it is the ensemble of such 
matchings in which each matching appears with equal proba- 
bility. Note that, as in other random graph models, multiedges 
are allowed, although in general they constitute a fraction only 
0(1 /n) of all edges and hence are usually negligible. 

An attractive feature of this model is that there turns out 
to be a simple and efficient algorithm for generating the net- 
works. Previous numerical schemes for generating acyclic 
graphs have relied on Monte Carlo techniques iHl^l, which 
are effective but slow. Our model, by contrast, allows a sim- 
ple constructive algorithm: starting with no edges in our net- 
work, we go through each vertex in time order and attach each 
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outgoing stub to an ingoing stub at an earlier vertex, chosen 
uniformly at random from the set of such stubs that are cur- 
rently unattached. With a suitable choice of data structures 
this algorithm runs in time of order the number of edges in the 
network. In practice, we can easily generate networks of up 
to a few billion vertices in reasonable running times. 

It may not be immediately clear that this algorithm gener- 
ates networks with the same probabilities as the model defined 
above, but it is easily proved. Consider the step of the algo- 
rithm in which we choose the destinations of the A:™' outgoing 
stubs at vertex /. At the start of this step, the number of un- 
used ingoing stubs at earlier vertices is L^-^'i ~ ^y"' — 
Xi + A;™', and the number of distinct matchings of /'s outgoing 
stubs to these ingoing ones is A^,- = (ki +fe°"')!/A,,!, each of 
which has the same probability 1 /Ni of being chosen. Thus 
the total probability of generating a specific matching for the 
whole network is n/'=2(l /^f)' which is clearly uniform over 
all matchings, as required, since it depends only on the degree 
sequence and not on the matching itself. 

Having defined our model and a method for drawing from 
its ensemble, we turn to the calculation of its properties. Our 
first goal is to find one of the most fundamental of network 
quantities, the probability of connection between a given pair 
of vertices, or more correctly the expected number of edges 
between them. Let us define fij to be the probability of con- 
nection between a given in-stub at vertex / and a given out- 
stub at vertex j, multiplied by the total number m of edges 
in the network. The stub connection probability is equal to 
the number of complete matchings in which these particular 
stubs are connected divided by the total number of matchings. 
Assuming / < j, this gives 



fij 



Then the expected number Pjj of edges between / and j is 

^in^out 



(2) 



(3) 



Note that in an ordinary (cyclic) directed random graph the 
expected number of edges between two vertices is kf^k°^^/m 
and hence /,y is the factor by which that number is modified 
in our acyclic model. 

By suitable manipulation, Eq. (|2]i can be rewritten as a 
product of independent functions of / and j: fjj = f\„aibj, 
with ai — b„ — 1 and 
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for all other i,]. This reduces the calculation of P,,- to the 
calculation of just 0(n) quantities, and for numerical purposes 
this is the quickest way to evaluate P,,-. Equation (01) also has 
the virtue of being manifestly symmetric with respect to in- 
and out-degrees (by contrast with Eq. 
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FIG. 1: Comparison of empirical measurements (jagged lines) and 
analytic predictions (curves) of fij for the citation network described 
in the text. The "time" of paper ; is defined to be / = i/n. Left: fij for 
citations to papers at time 0.1 (dotted line) from later times /. Right: 
fij for citations from papers at time 0.9 to earlier times t. 



As a demonstration of the application of the model, we 
show in Fig. [T] a comparison of our theoretical predictions 
for /,/ with measured values for a citation network consisting 
of « = 27221 physics papers on high-energy theory posted 
on the Physics E-print Archive at arxiv.org between 1992 and 
2003. We study fij rather than P,,- since the latter is strongly 



dependent on the degrees of individual vertices, via Eq. (O, 
making it a noisy function of its indices. By contrast, fij has 
only a weak dependence on individual degrees and is rela- 
tively smooth. We estimate fij for the observed network by 
counting the number of edges running between two windows 
of width 200 vertices centered on / and j, dividing by the num- 
ber of in-stubs in the first window and out-stubs in the second, 
and multiplying by m. 

As the figure shows, theory and observation are in remark- 
ably good agreement in this case, indicating that the edge 
probabilities are, at least on average, not far from those of 
the random graph. A normal (not acyclic) random directed 
graph 1 7], sometimes used as a crude model for acyclic net- 
works, would have fij — 1 for all /,y — a perfectly horizontal 
line in the figure — which would be entirely incompatible with 
the observations. (Other models, particularly preferential at- 
tachment models Il2lll3ll . make quite good models of citation 
networks, but our model is more general, being applicable also 
to many other acyclic networks for which preferential attach- 
ment is not a good match.) 

To make further progress it is convenient to consider, as 
with other random graph models, the behavior of the model 
in the limit of large network size. Let us define a "time" vari- 
able t e (0, 1] such that the time of vertex / is f = i/n, and 
let K'"(f ) and K°"'(f) be the densities of ingoing and outgoing 
edges over time, meaning that K'"(r) df is the fraction of ingo- 
ing edges in the interval f to f + df, and similarly for K°"'(r). 
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By analogy with earlier developments we also define 



Ht) = [k'"(0-k°"'(0] df', 



(5) 



and we define f{t,u) to be m times the probability that an 
in-stub at time f is connected to an out-stub at time u. Then, 
taking n ^ oo in Eq. ^ and assuming that A,,- is large compared 
to individual degrees, we find that f{t,u) — /(O, l)a(t)b{u), 
where 



a(t) ~ exp 



'K™'(f') ; 

^ 'dt' 



m 



, b{u) = exp 
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(6) 

Since every out-stub must connect to some in-stub, 
f{t,u) must also satisfy the normalization condition 
/o'K'°(f)/(f,M)df = 1. Substituting for /(f,M) from above and 
setting M = 1 then gives 
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which allows us to determine the overall normalization 
of f{t,u). If we wish we can also translate these results 
back into the language of individual vertices and write the 
probability of connection between vertices / and j as P,y = 
kf^kf'f{i/nj/n)/m. 

As an example, consider a random acyclic graph with 



K'"(r) = 2(l-f), k™'(m) = 2m. 
Using the formulas above, we then find that 
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Note that this diverges at f = 1 and m = 0, as it should: the 
probability of connection between an out-stub at time u and an 
earUer in-stub becomes large when u approaches zero because 
the number of earlier in-stubs is small (and similarly when f 
is large). 

The probabiUty of connection between vertices on the other 
hand does not diverge. Multiplying (|9]l by kfk°^^/m with / = 
nt, j — nu, averaging over the distributions of the degrees, and 
noting that the average in- and out-degrees at time f are cK™(f ) 
and cK°"'(f) where c — m/n is the average degree (in or out) 
of the network as a whole, we get 



c}^"it)cK°"\u) 



f{t,u) 



2c{\~t) X 2cu _ 2c 
2m{\ — t)u n 



, (10) 



which is constant. Thus all pairs of vertices are equally likely 
to be connected. In fact, this case is closely related to the 
so-called cascade model, an acyclic graph model used in the 
study of food webs llOfl . The cascade model also has con- 
stant probabilities of connection between vertices and more- 
over it can be shown that all networks with a given degree se- 
quence appear with the same probability in the cascade model. 



so that the set of such networks is a random acyclic graph in 
our sense 111 ill . 

As another example consider the widely studied class of 
networks gen erated by linear preferential attachment pro- 



cesses Ill2ill3 



mherent time-ordering 
of their vertices, these processes generate directed acyclic 
graphs and are commonly used as a simple model for citation 
networks among other things [islj. 

For a preferential attachment model in which each vertex 
created has out-degree c and attachment is proportional to 
A;™ + r with c and r constants, the mean in-degree as a function 
of time is r(f^'/''^+''' — 1) il4 , ll5ll . Consider a random acyclic 
graph with the same in- and out-degrees. Using the formulas 
above, we find that 



f{t,u) = 



1 



(l+r/c)(l-f/(f+'-))M'-/(^+'-)^ 



(11) 



which again diverges at f = 1 and u~Q. The average proba- 
bility of connection between vertices / and j is then 



c + r 



-cl{c+r) j-rl(c+r) _ 



(12) 



Remarkably, this is precisely the connection probability for 
the original preferential attachment model itself [15] . Indeed 
it can be shown, as with the cascade model, that networks 
with a given degree sequence occur with uniform probability 
in the preferential attachment model, and hence form a ran- 
dom acyclic graph according to our definition of the term. It 
is sometimes claimed that graphs generated by the preferen- 
tial attachment process are not truly random, since they con- 
tain correlations of various kinds Ujjj. Our results indicate, 
however, that, when one correctly accounts for the time order- 
ing of the vertices, the preferential attachment model is in fact 
simply a random graph. 

There are many other properties that can be computed for 
our model. Consider, for example, the number of paths be- 
tween vertices in the network. Let Dij be the expected number 
of directed paths from j to /. Since every such path consists 
either of just a single edge from j to / or of a path from j to 
some intermediate vertex v and then an edge from v to /, we 
can write 



Di 



.7-1 



After some computation, we then find that 



.7-1 

D'j^Pij n 

V=i+1 



(13) 



(14) 



When Dij is small, so that the probability of having more 
than one path is negligible, Dij can be treated as the proba- 
bility that a path exists. Within this "tree-like" regime, we 
can compute various quantities of interest starting from the 
expression for Dij. For instance, let Sj be the average size of 
the out-component reachable from vertex j — the total number 
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FIG. 2: Expected size of the out-component for the last (/ = 1) vertex 
in a graph measured as a fraction of system size. Solid lines represent 
the theoretical predictions. Points represent numerical results, aver- 
aged over 8000 graphs. Top: networks with the degree distribution 
of the cascade model. Inset: an enlargement of the leftmost portion 
of the curve, showing the agreement between theory and simulation 
in this region. Bottom: networks with the degree distribution of the 
preferential attachment model with r = kc. 



of papers cited directly or indirectly by j in the language of 
citation networks. Then = 1 + ^^^j Dij, which can be eval- 
uated explicitly in the large graph size limit. For the case of 
a cascade-type model obeying Eq. for example, this ex- 
pression gives s{t) — e^'', increasing exponentially with time 
and largest for the last vertex in the network. The tree-like 
assumption breaks down if Dij > 0(l/n) or equivalently if 
the sizes of out-components approach the size of the entire 
network. For the cascade model this happens if e^'^ ~ n, or 
equivalently c ~ ^Inn. Hence this breakdown is effectively a 
finite-size effect — in the limit of large n it is never observed. 
For other choices of degrees, however, the assumption of tree- 
like components can break down even in the large n limit. The 
preferential-attachment-type network is an example of this; 
here the assumption breaks down at c = 1 . Figure |2] shows 
a comparison of simulations and theory for both cases as a 
function of c. Agreement is excellent until we approach the 
expected breakdown point, at which simulation and theory di- 



verge significantly. 

In conclusion, we have proposed a random graph model 
for directed acyclic graphs, a large and important class that 
describes many real-world networks. We have defined the 
model for arbitrary degree sequences, given a fast algorithm 
for generating networks drawn from the model, and shown 
that a variety of the model's properties can be calculated ex- 
actly, both at finite sizes and in the limit of large network 
size. Just as ordinary undirected and directed random graphs 
have played many roles in the development of network the- 
ory, so the acyclic equivalent should prove useful in the study 
of acyclic networks, providing an analytically tractable model 
for structural network properties, a starting point for more 
complex analytic or numerical models, a null model for sta- 
tistical comparisons, and, we hope, other applications not yet 
envisioned. 
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